Introduction

Goals:

Load Packages

library(tidyverse)
library(forcats)
library(gapminder)
library(scales)
library(plotly)

Part 1: Factor management

With the data set of your choice, after ensuring the variable(s) you’re exploring are indeed factors, you are expected to:

Drop factor / levels; Reorder levels based on knowledge from data. We’ve elaborated on these steps for the gapminder and singer data sets below.

Be sure to also characterize the (derived) data before and after your factor re-leveling:

For this section, I will be using the hurricNamed dataset, which includes data on named Atlantic Hurricanes from 1950-2005.

#str(hurricNamed)

From this we can see that the singer dataset is bivariate with one numerical variable (height) and one categorical/factor variable (voice.part).

Explore the effects of arrange(). Does merely arranging the data have any effect on, say, a figure? Explore the effects of reordering a factor and factor reordering coupled with arrange(). Especially, what effect does this have on a figure? These explorations should involve the data, the factor levels, and some figures.

Elaboration for the Singer data set If necessary, transform some of the variables in the singer_locations dataframe into factors: pay attention at what levels you introduce and their order. Try and consider the difference between the base R as.factor and the forcats-provided functions.

Drop 0. Filter the singer_locations data to remove observations associated with the uncorrectly inputed year 0. Additionally, remove unused factor levels. Provide concrete information on the data before and after removing these rows and levels; address the number of rows and the levels of the affected factors.

Reorder the levels of year, artist_name or title. Use the forcats package to change the order of the factor levels, based on a principled summary of one of the quantitative variables. Consider experimenting with a summary statistic beyond the most basic choice of the median.

Part 2: File I/O

Experiment with one or more of write_csv()/read_csv() (and/or TSV friends), saveRDS()/readRDS(), dput()/dget(). Create something new, probably by filtering or grouped-summarization of Singer or Gapminder. I highly recommend you fiddle with the factor levels, i.e. make them non-alphabetical (see previous section). Explore whether this survives the round trip of writing to file then reading back in.

Part 3: Visualization design

Remake a past figure

In Homework 2, I made the following figure:

ggplot(gapminder, aes(x=pop, y=lifeExp)) +
  scale_x_log10() +
  facet_grid( ~ continent) +
  geom_point()

To improve this figure I’m going to attempt to:

  • Remove Oceania, to allow for more room for the other plots
  • Fix the x-axis so the values are readable
  • Add more labels to the y-axis
  • Change the theme to remove the grey background
  • Colour the points according to the year
gapminder %>% 
  filter(continent != "Oceania") %>% 
  ggplot(aes(x=pop / 1000000, y=lifeExp)) +
    facet_grid( ~ continent) +
    geom_point(aes(colour = year)) +
    labs(x = "Population (in millions)") +
    scale_x_log10(labels = comma_format()) +
    scale_y_continuous(breaks=10*(1:10)) +
    scale_color_distiller(
      palette = "YlGnBu", 
      direction = 1) +
    theme_light() +
    theme(axis.text.x = element_text(angle = 70, hjust = 1))

Remake at least one figure or create a new one, in light of something you learned in the recent class meetings about visualization design and color. Maybe juxtapose your first attempt and what you obtained after some time spent working on it. Reflect on the differences. If using Gapminder, you can use the country or continent color scheme that ships with Gapminder. Consult the dimensions listed in All the Graph Things.

Then, make a new graph by converting this visual (or another, if you’d like) to a plotly graph. What are some things that plotly makes possible, that are not possible with a regular ggplot2 graph?

gapminder %>%
  plot_ly(
    x = ~pop, 
    y = ~lifeExp, 
    color = ~continent, 
    frame = ~year, 
    text = ~country, 
    hoverinfo = "text",
    type = 'scatter',
    mode = 'markers') %>%
  layout(
    xaxis = list(type = "log"))

With this animated plot we’re able to definitively see the trend between population size and life expectancy over time. In general we can see that both population size and life expectancy are increasing, however we can also see there are many African countries that experience great dips in life expectancy in the ’90s. Let’s investigate this further with a line plot of life expectancy over time, in African countries:

gapminder %>%
  filter(continent == "Africa") %>% 
  plot_ly(
    x = ~year, 
    y = ~lifeExp, 
    color = ~country, 
    type = 'scatter',
    mode = 'lines') 

Thanks to this being a plotly plot, we’re able to hover over line series to identify what country corresponds to each line unambiguously. We are also able to de-select lines from displaying on the graph, zoom into areas of interest, and examing single lines at a time (apparently by double-clicking, but I’ve had difficulty getting this to reliably work).

Part 4: Writing figures to file

Use ggsave() to explicitly save a plot to file. Then use to load and embed it in your report. You can play around with various options, such as:

Arguments of ggsave(), such as width, height, resolution or text scaling. Various graphics devices, e.g. a vector vs. raster format. Explicit provision of the plot object p via ggsave(…, plot = p). Show a situation in which this actually matters.

But I want to do more!

For this, I’d like to explore the functions of forcats more. First, let’s try deriving a new data frame that includes a new factor level variable derived from the country variable in the gapminder dataset. Given my incredibly limited knowledge of geography, making a filtered dataset that includes all countries I confidently know the capitals of, is quite manageable.

knownCapitals <- gapminder %>% 
  filter(country %in% c("Canada", 
                        "United Kingdom", 
                        "France", 
                        "United States", 
                        "Argentina", 
                        "Norway")) %>% 
  mutate(country = fct_drop(country),
         capitals = fct_recode(country,
                               Ottawa = "Canada",
                               London = "United Kingdom",
                               Paris = "France",
                               Washington = "United States",
                               `Buenos Aires` = "Argentina",
                               Oslo = "Norway"))

str(knownCapitals)
## Classes 'tbl_df', 'tbl' and 'data.frame':    72 obs. of  7 variables:
##  $ country  : Factor w/ 6 levels "Argentina","Canada",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  62.5 64.4 65.1 65.6 67.1 ...
##  $ pop      : int  17876956 19610538 21283783 22934225 24779799 26983828 29341374 31620918 33958947 36203463 ...
##  $ gdpPercap: num  5911 6857 7133 8053 9443 ...
##  $ capitals : Factor w/ 6 levels "Buenos Aires",..: 1 1 1 1 1 1 1 1 1 1 ...

Awesome, so our new data frame includes 6 countries, with a new variable for the capital of each country. The country variable has also been collapsed to remove countries I don’t know the capitals of.

Let’s try re-ordering these factors for the sake of a plot. Let’s plot these based on the latitude of each capital, and see if we can see a trend in life expectancy.

knownCapitals %>% 
  mutate(capitals = fct_relevel(capitals, 
                                "Oslo", 
                                "London", 
                                "Paris", 
                                "Ottawa", 
                                "Washington", 
                                "Buenos Aires")) %>% 
  ggplot(aes(capitals, lifeExp)) +
    geom_boxplot()

Cool! So now we can see that there might be a downwards trend of life expectancy as the capital of the country is located further south. This is a super biased conclusion though, as the dataset we’re using is very incomplete!